
L'une des étapes importantes dans le processus d'investissement de toute entreprise est l'identification des marchés potentiels suceptibles de générer des retours sur investissement positif. Toute entreprise avant de se lancer dans un marché se doit d'étudier ce marché afin de déceller les opportunités présentes. Dans le cadre d'un investissement dans le domaine éducatif sur le plan international, il sied de savoir:.
Telles sont les questions fondamentales que nous allons resoudre dans notre travail. Il consiste à partir des données sur l'éducation de la banque mondiale méner une analyse afin de guider une entreprise dans son programme principalement d'expansion.
%matplotlib inline
import re
from os import path, getenv, environ
from typing import Dict
import folium
import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
import pandas as pd
import plotly.express as px
import scipy.stats as stats
import seaborn as sns
from sklearn import set_config
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
set_config(display="diagram", print_changed_only=False)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# Recupération de la localisation des fichiers pour facilitr le travail
ROOT_DIR = path.dirname(path.realpath("__file__"))
ROOT_DIR
'C:\\Users\\sareg\\Python Project'
# Importation de la base de données sous avec la fonction path.join qui prend en argument la localisation
# et le fichier des données
filepath = path.join(ROOT_DIR, "education_stats.csv")
# ce fichier recuperé sera affecté à la varaible data et le separateur est point virgule.
data = pd.read_csv(filepath, sep=";")
# Utilisation de shape pour afficher la taille( nombres de ligne, nombre de colonnes) de data
print("The data shape is: {shape}".format(shape=data.shape))
The data shape is: (2904, 30)
Ainsi la base comporte 2904 observations(lignes) et 30 colonnes ( variables)
# Appelation de la fonction head pour afficher les premières lignes de data
data.head()
| Country Name | Country Code | Indicator Name | Indicator Code | 2020 | Short Name | Table Name | Long Name | 2-alpha code | Currency Unit | Special Notes | Region | Income Group | WB-2 code | System of National Accounts | Alternative conversion factor | PPP survey year | Series Code | Topic | Short definition | Long definition | Unit of measure | Periodicity | Base Period | Other notes | Aggregation method | Limitations and exceptions | Notes from original source | General comments | Source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Arab World | ARB | Expenditure on education as % of total government expenditure (%) | SE.XPD.TOTL.GB.ZS | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | SE.XPD.TOTL.GB.ZS | Expenditures | NaN | Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 1 | Arab World | ARB | GDP per capita, PPP (constant 2011 international $) | NY.GDP.PCAP.PP.KD | 11450.86079 | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | NY.GDP.PCAP.PP.KD | Economic Policy & Debt: Purchasing power parity | NaN | GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. | NaN | Annual | 2011.0 | NaN | Weighted average | NaN | NaN | NaN | World Bank, International Comparison Program database. |
| 2 | Arab World | ARB | Government expenditure on post-secondary non-tertiary education as % of GDP (%) | UIS.XGDP.4.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.4.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on post-secondary non-tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 3 | Arab World | ARB | Government expenditure on secondary education as % of GDP (%) | UIS.XGDP.23.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.23.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on secondary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 4 | Arab World | ARB | Government expenditure on tertiary education as % of GDP (%) | UIS.XGDP.56.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.56.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
La base de données contient plusieurs variables. Nous avons entre autres la variable indicator Name qui indique le pays correspondant, une variable qui renseigne sur les indicateurs. Ainsi pour chaque pays on donne un certains nombre d'indicateurs tels que l'utilisation de l'internet, les depenses du gouverneent dans l'education le PIB par tete. La valeur de l'indicateur est localisée dans la variable 2020. Ces données concernent l'année 2020. Par ailleurs, la base contient des données manquantes.
# Affichage des informations sur les variables( les types: float, object, character,...)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2904 entries, 0 to 2903 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country Name 2904 non-null object 1 Country Code 2904 non-null object 2 Indicator Name 2904 non-null object 3 Indicator Code 2904 non-null object 4 2020 1588 non-null float64 5 Short Name 2892 non-null object 6 Table Name 2892 non-null object 7 Long Name 2892 non-null object 8 2-alpha code 2856 non-null object 9 Currency Unit 2580 non-null object 10 Special Notes 1740 non-null object 11 Region 2568 non-null object 12 Income Group 2568 non-null object 13 WB-2 code 2880 non-null object 14 System of National Accounts 2580 non-null object 15 Alternative conversion factor 564 non-null object 16 PPP survey year 1740 non-null object 17 Series Code 2662 non-null object 18 Topic 2662 non-null object 19 Short definition 242 non-null object 20 Long definition 2662 non-null object 21 Unit of measure 0 non-null float64 22 Periodicity 726 non-null object 23 Base Period 242 non-null float64 24 Other notes 242 non-null object 25 Aggregation method 726 non-null object 26 Limitations and exceptions 484 non-null object 27 Notes from original source 0 non-null float64 28 General comments 484 non-null object 29 Source 2662 non-null object dtypes: float64(4), object(26) memory usage: 680.8+ KB
# Stat descriptive des variables pour avoir une vue globale des variables
data.describe(include="all")
| Country Name | Country Code | Indicator Name | Indicator Code | 2020 | Short Name | Table Name | Long Name | 2-alpha code | Currency Unit | Special Notes | Region | Income Group | WB-2 code | System of National Accounts | Alternative conversion factor | PPP survey year | Series Code | Topic | Short definition | Long definition | Unit of measure | Periodicity | Base Period | Other notes | Aggregation method | Limitations and exceptions | Notes from original source | General comments | Source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2904 | 2904 | 2904 | 2904 | 1.588000e+03 | 2892 | 2892 | 2892 | 2856 | 2580 | 1740 | 2568 | 2568 | 2880 | 2580 | 564 | 1740 | 2662 | 2662 | 242 | 2662 | 0.0 | 726 | 242.0 | 242 | 726 | 484 | 0.0 | 484 | 2662 |
| unique | 242 | 242 | 12 | 12 | NaN | 241 | 241 | 241 | 238 | 152 | 131 | 7 | 5 | 240 | 3 | 32 | 3 | 11 | 8 | 1 | 11 | NaN | 1 | NaN | 1 | 2 | 2 | NaN | 2 | 4 |
| top | Estonia | TUN | GDP per capita, PPP (constant 2011 international $) | SE.XPD.TOTL.GB.ZS | NaN | Iraq | Estonia | Republic of Moldova | RW | Euro | April 2012 database update: Based on official government statistics, national accounts data were revised for 2000 onward; the base year changed to 2006. | Europe & Central Asia | Upper middle income | RW | Country uses the 1993 System of National Accounts methodology. | 1990–95 | 2005 | SE.XPD.TOTL.GB.ZS | Expenditures | The percentage of population (age 25 and over) with at least completed upper secondary education (ISCED 3 or higher). This indicator is calculated by dividing the number of persons aged 25 years and above with completed upper secondary education by the total population of the same age group and multiplying the result by 100. The UNESCO Institute for Statistics (UIS) educational attainment dataset shows the educational composition of the population aged 25 years and above and hence the stock and quality of human capital within a country. The dataset also reflects the structure and performance of the education system and its accumulated impact on human capital formation. For more information, visit the UNESCO Institute for Statistics website: http://www.uis.unesco.org/ | GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. | NaN | Annual | NaN | Cumulative Attainment | Weighted average | Operators have traditionally been the main source of telecommunications data, so information on subscriptions has been widely available for most countries. This gives a general idea of access, but a more precise measure is the penetration rate - the share of households with access to telecommunications. During the past few years more information on information and communication technology use has become available from household and business surveys. Also important are data on actual use of telecommunications services. Ideally, statistics on telecommunications (and other information and communications technologies) should be compiled for all three measures: subscriptions, access, and use. The quality of data varies among reporting countries as a result of differences in regulations covering data provision and availability.\n\nDiscrepancies may also arise in cases where the end of a fiscal year differs from that used by ITU, which is the end of December of every year. A number of countries have fiscal years that end in March or June of every year. | NaN | Please cite the International Telecommunication Union for third-party use of these data. | UNESCO Institute for Statistics |
| freq | 12 | 12 | 242 | 242 | NaN | 12 | 12 | 12 | 12 | 276 | 72 | 684 | 660 | 12 | 1980 | 96 | 1176 | 242 | 968 | 242 | 242 | NaN | 726 | NaN | 242 | 484 | 242 | NaN | 242 | 1936 |
| mean | NaN | NaN | NaN | NaN | 2.592391e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | 2.664621e+08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | 6.537010e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | 5.044388e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | 2.746509e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | 6.118075e+09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
Les noms des variabales contiennent des espaces, il faut renommer pour faciliter la manipulation.
# Renommer les variables contenant des espaces avec la fonction replace ( espace devient tiret de huit)
# Rename columns: replace space & - by _
data.columns = data.columns.str.lower().str.replace('[-\s]', '_', regex=True)
data.head()
| country_name | country_code | indicator_name | indicator_code | 2020 | short_name | table_name | long_name | 2_alpha_code | currency_unit | special_notes | region | income_group | wb_2_code | system_of_national_accounts | alternative_conversion_factor | ppp_survey_year | series_code | topic | short_definition | long_definition | unit_of_measure | periodicity | base_period | other_notes | aggregation_method | limitations_and_exceptions | notes_from_original_source | general_comments | source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Arab World | ARB | Expenditure on education as % of total government expenditure (%) | SE.XPD.TOTL.GB.ZS | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | SE.XPD.TOTL.GB.ZS | Expenditures | NaN | Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 1 | Arab World | ARB | GDP per capita, PPP (constant 2011 international $) | NY.GDP.PCAP.PP.KD | 11450.86079 | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | NY.GDP.PCAP.PP.KD | Economic Policy & Debt: Purchasing power parity | NaN | GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. | NaN | Annual | 2011.0 | NaN | Weighted average | NaN | NaN | NaN | World Bank, International Comparison Program database. |
| 2 | Arab World | ARB | Government expenditure on post-secondary non-tertiary education as % of GDP (%) | UIS.XGDP.4.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.4.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on post-secondary non-tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 3 | Arab World | ARB | Government expenditure on secondary education as % of GDP (%) | UIS.XGDP.23.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.23.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on secondary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 4 | Arab World | ARB | Government expenditure on tertiary education as % of GDP (%) | UIS.XGDP.56.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.56.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
# Vue globale sur les régions représentées à l'aide d'un diagramme circulaire( chaque région avec son nombre de pays: en pourcentage)
(data.region
.value_counts(dropna=False)
.plot(kind="pie", autopct='%.1f%%', legend = False, fontsize=10,
xlabel="", ylabel="", table=False, figsize=(5, 5))
);
Toutes les zones sont représentées. Mais il existe une région nan qui contient 11,6% de pays. Ce sont des erreurs d'observations cela veut dire que ces pays n'ont pas de régions.
# Affichage des pays correspondant à des régions manquantes à l'aide de la fonction loc. Cette fonction nous affiche les pays sans région
data.loc[data.region.isnull(), "country_name"].unique()
array(['Arab World', 'East Asia & Pacific',
'East Asia & Pacific (excluding high income)', 'Euro area',
'Europe & Central Asia',
'Europe & Central Asia (excluding high income)', 'European Union',
'Heavily indebted poor countries (HIPC)', 'High income',
'Latin America & Caribbean',
'Latin America & Caribbean (excluding high income)',
'Least developed countries: UN classification',
'Low & middle income', 'Low income', 'Lower middle income',
'Middle East & North Africa',
'Middle East & North Africa (excluding high income)',
'Middle income', 'North America', 'OECD members', 'South Asia',
'Sub-Saharan Africa', 'Sub-Saharan Africa (excluding high income)',
'Upper middle income', 'World', 'British Virgin Islands',
'Gibraltar', 'Nauru'], dtype=object)
# Suppression de tous les pays sans region( avec region manquante)
data = data.loc[data.region.notnull()]
Dans cette partie, il sera question d'effectuer une analyse globale de notre jeu de données. Nous allons nous interesser particulierement à:
data.describe(include="all")
| country_name | country_code | indicator_name | indicator_code | 2020 | short_name | table_name | long_name | 2_alpha_code | currency_unit | special_notes | region | income_group | wb_2_code | system_of_national_accounts | alternative_conversion_factor | ppp_survey_year | series_code | topic | short_definition | long_definition | unit_of_measure | periodicity | base_period | other_notes | aggregation_method | limitations_and_exceptions | notes_from_original_source | general_comments | source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2568 | 2568 | 2568 | 2568 | 1.436000e+03 | 2568 | 2568 | 2568 | 2532 | 2568 | 1440 | 2568 | 2568 | 2556 | 2568 | 564 | 1740 | 2354 | 2354 | 214 | 2354 | 0.0 | 642 | 214.0 | 214 | 642 | 428 | 0.0 | 428 | 2354 |
| unique | 214 | 214 | 12 | 12 | NaN | 214 | 214 | 214 | 211 | 151 | 106 | 7 | 5 | 213 | 3 | 32 | 3 | 11 | 8 | 1 | 11 | NaN | 1 | NaN | 1 | 2 | 2 | NaN | 2 | 4 |
| top | Estonia | TUN | Government expenditure on secondary education as % of GDP (%) | SE.XPD.TOTL.GB.ZS | NaN | Estonia | Estonia | Republic of Iraq | RW | Euro | April 2012 database update: Based on official government statistics, national accounts data were revised for 2000 onward; the base year changed to 2006. | Europe & Central Asia | Upper middle income | RW | Country uses the 1993 System of National Accounts methodology. | 1990–95 | 2005 | SE.XPD.TOTL.GB.ZS | Expenditures | The percentage of population (age 25 and over) with at least completed upper secondary education (ISCED 3 or higher). This indicator is calculated by dividing the number of persons aged 25 years and above with completed upper secondary education by the total population of the same age group and multiplying the result by 100. The UNESCO Institute for Statistics (UIS) educational attainment dataset shows the educational composition of the population aged 25 years and above and hence the stock and quality of human capital within a country. The dataset also reflects the structure and performance of the education system and its accumulated impact on human capital formation. For more information, visit the UNESCO Institute for Statistics website: http://www.uis.unesco.org/ | GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. | NaN | Annual | NaN | Cumulative Attainment | Weighted average | Operators have traditionally been the main source of telecommunications data, so information on subscriptions has been widely available for most countries. This gives a general idea of access, but a more precise measure is the penetration rate - the share of households with access to telecommunications. During the past few years more information on information and communication technology use has become available from household and business surveys. Also important are data on actual use of telecommunications services. Ideally, statistics on telecommunications (and other information and communications technologies) should be compiled for all three measures: subscriptions, access, and use. The quality of data varies among reporting countries as a result of differences in regulations covering data provision and availability.\n\nDiscrepancies may also arise in cases where the end of a fiscal year differs from that used by ITU, which is the end of December of every year. A number of countries have fiscal years that end in March or June of every year. | NaN | Please cite the International Telecommunication Union for third-party use of these data. | UNESCO Institute for Statistics |
| freq | 12 | 12 | 214 | 214 | NaN | 12 | 12 | 12 | 12 | 276 | 72 | 684 | 660 | 12 | 1980 | 96 | 1176 | 214 | 856 | 214 | 214 | NaN | 642 | NaN | 214 | 428 | 214 | NaN | 214 | 1712 |
| mean | NaN | NaN | NaN | NaN | 4.246925e+06 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | 4.586961e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | 5.699508e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | 4.871977e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | 2.327380e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | 1.262645e+09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
# Observation de la distribution des valeurs manquantes par colonne (histogramme)
msno.bar(data, figsize=(20, 7), fontsize=10);
#Distribution des valeurs manquantes par la fonction matrix: elle nous indique les positions des valeurs manquantes
msno.matrix(data, figsize=(25, 10), fontsize=10);
### Fréquence de la variable income group (histogramme par groupe de revenu)
data.income_group.value_counts().plot(kind="barh");
# filtrer low income group: Ce sont des pays à RNB(Revenu national Brut) par habitant inferieur ou égal à 1000 dollars selon la
# banque mondiale
# Nous allons recupérer ces pays et les écarter de l'analyse car ils n'ont pas de fort potentiel éducatif
LOW_INCOME_FILTER = data.income_group.str.contains("low", regex=True, flags=re.I, na=False)
data.loc[LOW_INCOME_FILTER, "country_name"].unique()
array(['Afghanistan', 'Armenia', 'Bangladesh', 'Benin', 'Bhutan',
'Bolivia', 'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia',
'Cameroon', 'Central African Republic', 'Chad', 'Comoros',
'Congo, Dem. Rep.', 'Congo, Rep.', "Cote d'Ivoire", 'Djibouti',
'Egypt, Arab Rep.', 'El Salvador', 'Eritrea', 'Ethiopia',
'Gambia, The', 'Georgia', 'Ghana', 'Guatemala', 'Guinea',
'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'India',
'Indonesia', 'Kenya', 'Kiribati', 'Korea, Dem. People’s Rep.',
'Kosovo', 'Kyrgyz Republic', 'Lao PDR', 'Lesotho', 'Liberia',
'Madagascar', 'Malawi', 'Mali', 'Mauritania',
'Micronesia, Fed. Sts.', 'Moldova', 'Mongolia', 'Morocco',
'Mozambique', 'Myanmar', 'Nepal', 'Nicaragua', 'Niger', 'Nigeria',
'Pakistan', 'Papua New Guinea', 'Paraguay', 'Philippines',
'Rwanda', 'Samoa', 'Sao Tome and Principe', 'Senegal',
'Sierra Leone', 'Solomon Islands', 'Somalia', 'South Sudan',
'Sri Lanka', 'Sudan', 'Swaziland', 'Syrian Arab Republic',
'Tajikistan', 'Tanzania', 'Timor-Leste', 'Togo', 'Uganda',
'Ukraine', 'Uzbekistan', 'Vanuatu', 'Vietnam',
'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe'],
dtype=object)
# Ne garder que les pays n'ayant pas de faible RNB par habitant
data = data.loc[~LOW_INCOME_FILTER]
print(f"data shape: {data.shape}")
data.head()
data shape: (1896, 30)
| country_name | country_code | indicator_name | indicator_code | 2020 | short_name | table_name | long_name | 2_alpha_code | currency_unit | special_notes | region | income_group | wb_2_code | system_of_national_accounts | alternative_conversion_factor | ppp_survey_year | series_code | topic | short_definition | long_definition | unit_of_measure | periodicity | base_period | other_notes | aggregation_method | limitations_and_exceptions | notes_from_original_source | general_comments | source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Arab World | ARB | Expenditure on education as % of total government expenditure (%) | SE.XPD.TOTL.GB.ZS | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | SE.XPD.TOTL.GB.ZS | Expenditures | NaN | Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 1 | Arab World | ARB | GDP per capita, PPP (constant 2011 international $) | NY.GDP.PCAP.PP.KD | 11450.86079 | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | NY.GDP.PCAP.PP.KD | Economic Policy & Debt: Purchasing power parity | NaN | GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. | NaN | Annual | 2011.0 | NaN | Weighted average | NaN | NaN | NaN | World Bank, International Comparison Program database. |
| 2 | Arab World | ARB | Government expenditure on post-secondary non-tertiary education as % of GDP (%) | UIS.XGDP.4.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.4.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on post-secondary non-tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 3 | Arab World | ARB | Government expenditure on secondary education as % of GDP (%) | UIS.XGDP.23.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.23.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on secondary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
| 4 | Arab World | ARB | Government expenditure on tertiary education as % of GDP (%) | UIS.XGDP.56.FSGOV | NaN | Arab World | Arab World | Arab World | 1A | NaN | Arab World aggregate. Arab World is composed of members of the League of Arab States. | NaN | NaN | 1A | NaN | NaN | NaN | UIS.XGDP.56.FSGOV | Expenditures | NaN | Total general (local, regional and central) government expenditure on tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UNESCO Institute for Statistics |
data.describe(include="all")
| country_name | country_code | indicator_name | indicator_code | 2020 | short_name | table_name | long_name | 2_alpha_code | currency_unit | special_notes | region | income_group | wb_2_code | system_of_national_accounts | alternative_conversion_factor | ppp_survey_year | series_code | topic | short_definition | long_definition | unit_of_measure | periodicity | base_period | other_notes | aggregation_method | limitations_and_exceptions | notes_from_original_source | general_comments | source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1896 | 1896 | 1896 | 1896 | 1.037000e+03 | 1884 | 1884 | 1884 | 1860 | 1572 | 1140 | 1560 | 1560 | 1872 | 1572 | 240 | 996 | 1738 | 1738 | 158 | 1738 | 0.0 | 474 | 158.0 | 158 | 474 | 316 | 0.0 | 316 | 1738 |
| unique | 158 | 158 | 12 | 12 | NaN | 157 | 157 | 157 | 155 | 87 | 86 | 7 | 3 | 156 | 3 | 16 | 3 | 11 | 8 | 1 | 11 | NaN | 1 | NaN | 1 | 2 | 2 | NaN | 2 | 4 |
| top | Arab World | ARB | Expenditure on education as % of total government expenditure (%) | SE.XPD.TOTL.GB.ZS | NaN | Arab World | Arab World | Arab World | 1A | Euro | April 2012 database update: Based on official government statistics, national accounts data were revised for 2000 onward; the base year changed to 2006. | Europe & Central Asia | Upper middle income | 1A | Country uses the 1993 System of National Accounts methodology. | 1987–95 | Rolling | SE.XPD.TOTL.GB.ZS | Expenditures | The percentage of population (age 25 and over) with at least completed upper secondary education (ISCED 3 or higher). This indicator is calculated by dividing the number of persons aged 25 years and above with completed upper secondary education by the total population of the same age group and multiplying the result by 100. The UNESCO Institute for Statistics (UIS) educational attainment dataset shows the educational composition of the population aged 25 years and above and hence the stock and quality of human capital within a country. The dataset also reflects the structure and performance of the education system and its accumulated impact on human capital formation. For more information, visit the UNESCO Institute for Statistics website: http://www.uis.unesco.org/ | Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | Annual | NaN | Cumulative Attainment | Weighted average | Operators have traditionally been the main source of telecommunications data, so information on subscriptions has been widely available for most countries. This gives a general idea of access, but a more precise measure is the penetration rate - the share of households with access to telecommunications. During the past few years more information on information and communication technology use has become available from household and business surveys. Also important are data on actual use of telecommunications services. Ideally, statistics on telecommunications (and other information and communications technologies) should be compiled for all three measures: subscriptions, access, and use. The quality of data varies among reporting countries as a result of differences in regulations covering data provision and availability.\n\nDiscrepancies may also arise in cases where the end of a fiscal year differs from that used by ITU, which is the end of December of every year. A number of countries have fiscal years that end in March or June of every year. | NaN | Please cite the International Telecommunication Union for third-party use of these data. | UNESCO Institute for Statistics |
| freq | 12 | 12 | 158 | 158 | NaN | 12 | 12 | 12 | 12 | 264 | 72 | 588 | 660 | 12 | 1248 | 48 | 444 | 158 | 632 | 158 | 158 | NaN | 474 | NaN | 158 | 316 | 158 | NaN | 158 | 1264 |
| mean | NaN | NaN | NaN | NaN | 3.705797e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | 3.274435e+08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | 1.670000e-03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | 9.728620e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | 5.565753e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | 7.931259e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | 6.118075e+09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
#Ce sont les indicateurs qui nous serviront dans la suite, pour notre analyse
#Affichage de la liste des indicateurs( noms et code par ordre croissant de code)
(data
.loc[:, ["indicator_name", "indicator_code"]]
.drop_duplicates()
.sort_values(by="indicator_code")
.reset_index(drop=True)
)
| indicator_name | indicator_code | |
|---|---|---|
| 0 | Internet users (per 100 people) | IT.NET.USER.P2 |
| 1 | GDP per capita, PPP (constant 2011 international $) | NY.GDP.PCAP.PP.KD |
| 2 | Gross enrolment ratio, upper secondary, both sexes (%) | SE.SEC.ENRR.UP |
| 3 | Gross enrolment ratio, tertiary, both sexes (%) | SE.TER.ENRR |
| 4 | Expenditure on education as % of total government expenditure (%) | SE.XPD.TOTL.GB.ZS |
| 5 | Population, ages 15-64 (% of total) | SP.POP.1564.TO.ZS |
| 6 | Population, total | SP.POP.TOTL |
| 7 | UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total | UIS.EA.3T6.AG25T99 |
| 8 | Gross enrolment ratio, post-secondary non-tertiary, both sexes (%) | UIS.GER.4 |
| 9 | Government expenditure on secondary education as % of GDP (%) | UIS.XGDP.23.FSGOV |
| 10 | Government expenditure on post-secondary non-tertiary education as % of GDP (%) | UIS.XGDP.4.FSGOV |
| 11 | Government expenditure on tertiary education as % of GDP (%) | UIS.XGDP.56.FSGOV |
# description du nombre d'indicateurs par pays
df_nb_indicators_by_country = data.groupby(["region", "country_name", "country_code"],
as_index=False, dropna=False)["2020"].count()
df_nb_indicators_by_country.describe(include="all")
| region | country_name | country_code | 2020 | |
|---|---|---|---|---|
| count | 130 | 158 | 158 | 158.000000 |
| unique | 7 | 158 | 158 | NaN |
| top | Europe & Central Asia | American Samoa | ASM | NaN |
| freq | 49 | 1 | 1 | NaN |
| mean | NaN | NaN | NaN | 6.563291 |
| std | NaN | NaN | NaN | 2.774850 |
| min | NaN | NaN | NaN | 0.000000 |
| 25% | NaN | NaN | NaN | 5.000000 |
| 50% | NaN | NaN | NaN | 6.000000 |
| 75% | NaN | NaN | NaN | 8.750000 |
| max | NaN | NaN | NaN | 12.000000 |
df_nb_indicators_by_country.plot(kind="hist", title="Distribution - nombre d'indicateur par pays");
# La carte des pays selon le nombre d'indicateurs disponible
px.choropleth(df_nb_indicators_by_country,
locations="country_code",
color="2020",
# hover_name="country_name",
hover_data=["region", "country_name", "country_code"],
animation_frame=None,
color_continuous_scale='Plasma',
height=500,
width=None,
title="Number of indicator by country",
#range_color=(0, 12),
)
# On va utiliser le quantile d'ordre 1(25%) du nombre d'indicateur disponoble pour filtrer les pays
q1 = df_nb_indicators_by_country.quantile(.25).values[0]
q1
5.0
# countries that have less than 5 indicators, ces pays ne contiennent pas suffisament d'indicateurs pour l'analyse
less_than_1rs_quartile_indicators = df_nb_indicators_by_country.loc[df_nb_indicators_by_country["2020"] <= q1,
"country_name"]
display(less_than_1rs_quartile_indicators)
data = data.loc[~data.country_name.isin(less_than_1rs_quartile_indicators)]
0 American Samoa 5 French Polynesia 6 Guam 7 Hong Kong SAR, China 12 Marshall Islands 13 New Caledonia 15 Northern Mariana Islands 16 Palau 17 Singapore 20 Tuvalu 22 Andorra 25 Belarus 27 Bosnia and Herzegovina 29 Channel Islands 35 Faroe Islands 40 Greenland 44 Isle of Man 48 Liechtenstein 52 Monaco 53 Montenegro 60 San Marino 61 Serbia 68 Turkmenistan 77 Cayman Islands 82 Curacao 83 Dominica 86 Grenada 91 Puerto Rico 92 Sint Maarten (Dutch part) 95 St. Martin (French part) 97 Suriname 99 Turks and Caicos Islands 102 Virgin Islands (U.S.) 111 Libya 117 United Arab Emirates 118 Bermuda 121 Maldives 125 Gabon 131 British Virgin Islands 138 Gibraltar 150 Nauru Name: country_name, dtype: object
data.describe(include="all")
| country_name | country_code | indicator_name | indicator_code | 2020 | short_name | table_name | long_name | 2_alpha_code | currency_unit | special_notes | region | income_group | wb_2_code | system_of_national_accounts | alternative_conversion_factor | ppp_survey_year | series_code | topic | short_definition | long_definition | unit_of_measure | periodicity | base_period | other_notes | aggregation_method | limitations_and_exceptions | notes_from_original_source | general_comments | source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1404 | 1404 | 1404 | 1404 | 9.150000e+02 | 1404 | 1404 | 1404 | 1392 | 1104 | 972 | 1104 | 1104 | 1392 | 1104 | 192 | 900 | 1287 | 1287 | 117 | 1287 | 0.0 | 351 | 117.0 | 117 | 351 | 234 | 0.0 | 234 | 1287 |
| unique | 117 | 117 | 12 | 12 | NaN | 117 | 117 | 117 | 116 | 72 | 74 | 6 | 3 | 116 | 3 | 13 | 3 | 11 | 8 | 1 | 11 | NaN | 1 | NaN | 1 | 2 | 2 | NaN | 2 | 4 |
| top | Arab World | ARB | Expenditure on education as % of total government expenditure (%) | SE.XPD.TOTL.GB.ZS | NaN | Arab World | Arab World | Arab World | 1A | Euro | April 2012 database update: Based on official government statistics, national accounts data were revised for 2000 onward; the base year changed to 2006. | Europe & Central Asia | Upper middle income | 1A | Country uses the 1993 System of National Accounts methodology. | 1987–95 | Rolling | SE.XPD.TOTL.GB.ZS | Expenditures | The percentage of population (age 25 and over) with at least completed upper secondary education (ISCED 3 or higher). This indicator is calculated by dividing the number of persons aged 25 years and above with completed upper secondary education by the total population of the same age group and multiplying the result by 100. The UNESCO Institute for Statistics (UIS) educational attainment dataset shows the educational composition of the population aged 25 years and above and hence the stock and quality of human capital within a country. The dataset also reflects the structure and performance of the education system and its accumulated impact on human capital formation. For more information, visit the UNESCO Institute for Statistics website: http://www.uis.unesco.org/ | Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ | NaN | Annual | NaN | Cumulative Attainment | Weighted average | Operators have traditionally been the main source of telecommunications data, so information on subscriptions has been widely available for most countries. This gives a general idea of access, but a more precise measure is the penetration rate - the share of households with access to telecommunications. During the past few years more information on information and communication technology use has become available from household and business surveys. Also important are data on actual use of telecommunications services. Ideally, statistics on telecommunications (and other information and communications technologies) should be compiled for all three measures: subscriptions, access, and use. The quality of data varies among reporting countries as a result of differences in regulations covering data provision and availability.\n\nDiscrepancies may also arise in cases where the end of a fiscal year differs from that used by ITU, which is the end of December of every year. A number of countries have fiscal years that end in March or June of every year. | NaN | Please cite the International Telecommunication Union for third-party use of these data. | UNESCO Institute for Statistics |
| freq | 12 | 12 | 117 | 117 | NaN | 12 | 12 | 12 | 12 | 204 | 48 | 432 | 480 | 12 | 936 | 48 | 408 | 117 | 468 | 117 | 117 | NaN | 351 | NaN | 117 | 234 | 117 | NaN | 117 | 936 |
| mean | NaN | NaN | NaN | NaN | 4.194025e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | 3.483212e+08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | 1.670000e-03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | 8.327495e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | 5.159572e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | 1.917860e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | 6.118075e+09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011.0 | NaN | NaN | NaN | NaN | NaN | NaN |
sns.catplot(data=data, y='region', x='2020', col='indicator_code',
height=5, aspect=1.5, palette="husl", orient="h",
col_wrap=2, kind="box", sharex=False, sharey=True);
# topn (n=10) of country by indicator: On a ainsi le top des 10 pays par indicateur
TOPN = 10
df_country_by_indicator = (data
.groupby(['indicator_name'], as_index=True, dropna=False)
[['region', 'country_name', '2020']]
.apply(lambda x: x.nlargest(TOPN, columns=['2020'], keep="all"))
.reset_index()
)
df_country_by_indicator
| indicator_name | level_1 | region | country_name | 2020 | |
|---|---|---|---|---|---|
| 0 | Expenditure on education as % of total government expenditure (%) | 2496 | Latin America & Caribbean | St. Vincent and the Grenadines | 30.789200 |
| 1 | Expenditure on education as % of total government expenditure (%) | 2604 | East Asia & Pacific | Thailand | 28.388599 |
| 2 | Expenditure on education as % of total government expenditure (%) | 2664 | Middle East & North Africa | Tunisia | 25.047310 |
| 3 | Expenditure on education as % of total government expenditure (%) | 2640 | East Asia & Pacific | Tonga | 22.955860 |
| 4 | Expenditure on education as % of total government expenditure (%) | 1932 | Sub-Saharan Africa | Namibia | 21.925060 |
| ... | ... | ... | ... | ... | ... |
| 115 | UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total | 2291 | Middle East & North Africa | Saudi Arabia | 30.066601 |
| 116 | UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total | 503 | Latin America & Caribbean | Barbados | 24.205730 |
| 117 | UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total | 1787 | Middle East & North Africa | Malta | 20.398439 |
| 118 | UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total | 1823 | Sub-Saharan Africa | Mauritius | 20.398439 |
| 119 | UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total | 1835 | Latin America & Caribbean | Mexico | 18.577311 |
120 rows × 5 columns
#Plot des pays dans le top10
g = sns.catplot(data=df_country_by_indicator,
y="country_name",
x="2020",
kind="bar",
orient="h",
height=5,
aspect=1.7,
hue="region",
col="indicator_name",
col_wrap=2,
sharey=False,
sharex=False,
)
g.set_axis_labels("", "")
<seaborn.axisgrid.FacetGrid at 0x25aa1b75250>
# Nuage de poins des pays pour chaque variable
sns.catplot(data=data,
x="2020",
col="indicator_code",
kind="strip",
orient="h",
height=4,
aspect=1.5,
col_wrap=2,
sharex=False,
color="blue",
);
La disposition des variables ne permet pas une bonne analyse statistique car ce sont les indicateurs qui constituent nos variables danalyse. Il faut donc que les indicateurs soient en colonne et les pays en ligne. Ainsi la commande pivot de la balise pandas sera utilisée. Elle fonctionne exactement comme la commande reshape sur stata. Ainsi nous obtenons une base qui nous renseigne pour chaque pays ou zone les différents indicateurs. La commande pivot nous a facilité ici la tache au lieu de procéder à des manipulation longues. Implicitement, nous allons éliminer plusieurs colonnes qui ne contenaient jusque là que des métadonnées(des informations sur les calcul des indicateurs).
# Create a spreadsheet-style pivot table as a DataFrame.
data_pivot = pd.pivot_table(data=data,
index=["region", "country_name", "country_code"],
values='2020',
columns='indicator_code').reset_index()
print(f"Data shape: {data_pivot.shape}")
data_pivot.head()
Data shape: (92, 15)
| indicator_code | region | country_name | country_code | IT.NET.USER.P2 | NY.GDP.PCAP.PP.KD | SE.SEC.ENRR.UP | SE.TER.ENRR | SE.XPD.TOTL.GB.ZS | SP.POP.1564.TO.ZS | SP.POP.TOTL | UIS.EA.3T6.AG25T99 | UIS.GER.4 | UIS.XGDP.23.FSGOV | UIS.XGDP.4.FSGOV | UIS.XGDP.56.FSGOV |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | East Asia & Pacific | Australia | AUS | 46.756116 | 35281.395362 | 249.607498 | 67.03891 | 13.360160 | 66.788576 | 1.915300e+07 | NaN | 61.390049 | 1.88884 | 0.07649 | 1.14633 |
| 1 | East Asia & Pacific | Brunei Darussalam | BRN | 8.996285 | 82049.580860 | 67.462151 | 12.69113 | 8.920740 | 66.955147 | 3.332410e+05 | NaN | 1.296230 | NaN | NaN | NaN |
| 2 | East Asia & Pacific | China | CHN | 1.775913 | 3700.743648 | 38.959030 | 7.72048 | NaN | 68.462574 | 1.262645e+09 | NaN | 5.307240 | NaN | NaN | NaN |
| 3 | East Asia & Pacific | Fiji | FJI | 1.496855 | 6673.847391 | 59.116379 | NaN | 20.484921 | 61.538812 | 8.112230e+05 | NaN | 2.885000 | NaN | NaN | 0.84457 |
| 4 | East Asia & Pacific | Japan | JPN | 29.990740 | 33871.843545 | 100.454224 | 48.73653 | 9.930410 | 68.232188 | 1.268430e+08 | NaN | 0.844730 | 1.44606 | NaN | 0.54858 |
# Description de la nouvelle base
data_pivot.describe(include="all")
| indicator_code | region | country_name | country_code | IT.NET.USER.P2 | NY.GDP.PCAP.PP.KD | SE.SEC.ENRR.UP | SE.TER.ENRR | SE.XPD.TOTL.GB.ZS | SP.POP.1564.TO.ZS | SP.POP.TOTL | UIS.EA.3T6.AG25T99 | UIS.GER.4 | UIS.XGDP.23.FSGOV | UIS.XGDP.4.FSGOV | UIS.XGDP.56.FSGOV |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 92 | 92 | 92 | 91.000000 | 90.000000 | 84.000000 | 69.000000 | 63.000000 | 91.000000 | 9.200000e+01 | 11.000000 | 56.000000 | 46.000000 | 25.000000 | 47.000000 |
| unique | 6 | 92 | 92 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | Europe & Central Asia | Australia | AUS | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 36 | 1 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | NaN | 13.456444 | 23297.936342 | 81.920940 | 38.041727 | 14.193053 | 64.749554 | 3.592016e+07 | 36.865614 | 16.628987 | 1.814862 | 0.087523 | 0.983590 |
| std | NaN | NaN | NaN | 14.729495 | 18997.157329 | 35.704867 | 19.611261 | 5.344908 | 4.236421 | 1.359502e+08 | 21.497669 | 14.870619 | 0.587587 | 0.108619 | 0.455717 |
| min | NaN | NaN | NaN | 0.105046 | 3508.905125 | 8.165120 | 3.234750 | 4.964300 | 50.130735 | 4.537400e+04 | 15.675070 | 0.335280 | 0.397710 | 0.001670 | 0.059390 |
| 25% | NaN | NaN | NaN | 3.180408 | 10260.927062 | 59.539659 | 22.362659 | 10.375395 | 62.227768 | 1.645501e+06 | 20.398439 | 5.921795 | 1.420477 | 0.023290 | 0.741090 |
| 50% | NaN | NaN | NaN | 6.731396 | 15654.111790 | 80.134777 | 37.147419 | 13.360160 | 65.567704 | 6.736625e+06 | 30.066601 | 12.091540 | 1.831825 | 0.047550 | 0.869290 |
| 75% | NaN | NaN | NaN | 17.225234 | 34689.708142 | 98.578098 | 54.428761 | 16.599045 | 67.859249 | 2.379614e+07 | 50.060844 | 26.575469 | 2.111583 | 0.127030 | 1.214570 |
| max | NaN | NaN | NaN | 52.000000 | 108323.903919 | 249.607498 | 82.439072 | 30.789200 | 72.588714 | 1.262645e+09 | 71.459663 | 61.390049 | 2.970210 | 0.536130 | 2.425410 |
#Visualisons à nouveau les valeurs manquantes de la base
msno.bar(data_pivot);
msno.matrix(data_pivot);
#liste des variables(indicteurs) de data_pivot
indicators = ['IT.NET.USER.P2', # internet
"NY.GDP.PCAP.PP.KD", # GDP
"SP.POP.1564.TO.ZS", # population
# enrolment
"SE.SEC.ENRR.UP",
"SE.TER.ENRR",
"UIS.GER.4",
# expenditure
"SE.XPD.TOTL.GB.ZS",
"UIS.XGDP.23.FSGOV",
"UIS.XGDP.4.FSGOV",
"UIS.XGDP.56.FSGOV",
]
len(indicators)
10
# Création de nouvelles variables
# total_enrolment= moyenne des trois taux de scolarisation( secondaire,post-secondaire non-supérieur, supérieur)
# gov_expenditure= somme des dépenses du gouvernement( secondaire,post-secondaire non-supérieur, supérieur)
# Renommer les variables(indicateurs) SE.XPD.TOTL.GB.ZS ; IT.NET.USER.P2 ; NY.GDP.PCAP.PP.KD .
data_pivot = (data_pivot
.assign(total_enrolment=lambda dframe: dframe[["SE.SEC.ENRR.UP", "SE.TER.ENRR", "UIS.GER.4"]
].mean(axis=1),
gov_expenditure=lambda dframe: (dframe["UIS.XGDP.23.FSGOV"]
+ dframe["UIS.XGDP.4.FSGOV"]
+ dframe["UIS.XGDP.56.FSGOV"]
),
)
.rename(columns={"SE.XPD.TOTL.GB.ZS": "percent_education_total_gov_expenditure",
"IT.NET.USER.P2": "internet_users",
"NY.GDP.PCAP.PP.KD": "gdp_per_capita"})
)
data_pivot.head()
| indicator_code | region | country_name | country_code | internet_users | gdp_per_capita | SE.SEC.ENRR.UP | SE.TER.ENRR | percent_education_total_gov_expenditure | SP.POP.1564.TO.ZS | SP.POP.TOTL | UIS.EA.3T6.AG25T99 | UIS.GER.4 | UIS.XGDP.23.FSGOV | UIS.XGDP.4.FSGOV | UIS.XGDP.56.FSGOV | total_enrolment | gov_expenditure |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | East Asia & Pacific | Australia | AUS | 46.756116 | 35281.395362 | 249.607498 | 67.03891 | 13.360160 | 66.788576 | 1.915300e+07 | NaN | 61.390049 | 1.88884 | 0.07649 | 1.14633 | 126.012152 | 3.11166 |
| 1 | East Asia & Pacific | Brunei Darussalam | BRN | 8.996285 | 82049.580860 | 67.462151 | 12.69113 | 8.920740 | 66.955147 | 3.332410e+05 | NaN | 1.296230 | NaN | NaN | NaN | 27.149837 | NaN |
| 2 | East Asia & Pacific | China | CHN | 1.775913 | 3700.743648 | 38.959030 | 7.72048 | NaN | 68.462574 | 1.262645e+09 | NaN | 5.307240 | NaN | NaN | NaN | 17.328917 | NaN |
| 3 | East Asia & Pacific | Fiji | FJI | 1.496855 | 6673.847391 | 59.116379 | NaN | 20.484921 | 61.538812 | 8.112230e+05 | NaN | 2.885000 | NaN | NaN | 0.84457 | 31.000689 | NaN |
| 4 | East Asia & Pacific | Japan | JPN | 29.990740 | 33871.843545 | 100.454224 | 48.73653 | 9.930410 | 68.232188 | 1.268430e+08 | NaN | 0.844730 | 1.44606 | NaN | 0.54858 | 50.011828 | NaN |
# Visualisation ( carte des variables )
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_geo = f'{url}/world-countries.json'
for col_name in ['internet_users',
"gdp_per_capita",
"total_enrolment",
"gov_expenditure",
"percent_education_total_gov_expenditure",
]:
m = folium.Map(location=[0, 0], zoom_start=1)
folium.Choropleth(geo_data=country_geo,
data=data_pivot,
columns=['country_code', col_name],
key_on='feature.id',
name="choropleth",
fill_color='BuPu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name=col_name,
nan_fill_color="black",
highlight=True,
).add_to(m)
display(m)
# Affichage des premières lignes de data_pivot
data_pivot.head()
| indicator_code | region | country_name | country_code | internet_users | gdp_per_capita | SE.SEC.ENRR.UP | SE.TER.ENRR | percent_education_total_gov_expenditure | SP.POP.1564.TO.ZS | SP.POP.TOTL | UIS.EA.3T6.AG25T99 | UIS.GER.4 | UIS.XGDP.23.FSGOV | UIS.XGDP.4.FSGOV | UIS.XGDP.56.FSGOV | total_enrolment | gov_expenditure |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | East Asia & Pacific | Australia | AUS | 46.756116 | 35281.395362 | 249.607498 | 67.03891 | 13.360160 | 66.788576 | 1.915300e+07 | NaN | 61.390049 | 1.88884 | 0.07649 | 1.14633 | 126.012152 | 3.11166 |
| 1 | East Asia & Pacific | Brunei Darussalam | BRN | 8.996285 | 82049.580860 | 67.462151 | 12.69113 | 8.920740 | 66.955147 | 3.332410e+05 | NaN | 1.296230 | NaN | NaN | NaN | 27.149837 | NaN |
| 2 | East Asia & Pacific | China | CHN | 1.775913 | 3700.743648 | 38.959030 | 7.72048 | NaN | 68.462574 | 1.262645e+09 | NaN | 5.307240 | NaN | NaN | NaN | 17.328917 | NaN |
| 3 | East Asia & Pacific | Fiji | FJI | 1.496855 | 6673.847391 | 59.116379 | NaN | 20.484921 | 61.538812 | 8.112230e+05 | NaN | 2.885000 | NaN | NaN | 0.84457 | 31.000689 | NaN |
| 4 | East Asia & Pacific | Japan | JPN | 29.990740 | 33871.843545 | 100.454224 | 48.73653 | 9.930410 | 68.232188 | 1.268430e+08 | NaN | 0.844730 | 1.44606 | NaN | 0.54858 | 50.011828 | NaN |
Dans cette partie, nous allons construire un score d'attractivité. Celui-ci nous permettra d'ordonner les pays en fonction de leur attractivité en terme d'investissement dans l'éducation.
Pour ce faire nous allons utiliser deux methodes, la première est une pondération et la deuxième est une analyse en composante principale.
# Ici on s'intéresse aux variables clés : le nombre d'utilisateurs d'internet, le PIB par habitant,
# le taux moyen de scolarisation et les dépenses totales du gouvernement dans l'éducation(en %).
# La liste des indicateurs pour le score d'attractivité
list_features = ["internet_users",
"gdp_per_capita",
"total_enrolment",
"percent_education_total_gov_expenditure",
]
# Définition de la fonction Scoring
def scoring(dframe: pd.DataFrame,
features_weight: Dict[str, float]
) -> float:
""" Country attractivity score
Args:
dframe (pd.DataFrame): data frame
Returns:
float: attractivity score
"""
score = 0
for col_name, weight in features_weight.items():
score += weight * dframe[col_name]
return score
# internet users stats
internet_stat = data_pivot["internet_users"].describe(percentiles=np.arange(0, 1, 0.1))
internet_stat
count 92.000000 mean 13.355744 std 14.680149 min 0.105046 0% 0.105046 10% 1.511643 20% 2.774729 30% 3.693737 40% 5.542655 50% 6.688139 60% 8.597771 70% 14.869559 80% 23.066165 90% 42.688489 max 52.000000 Name: internet_users, dtype: float64
# filter out all coutries that have internet_users less thant 80% decile
data_custom = data_pivot.loc[data_pivot["internet_users"] >= internet_stat.loc["70%"], :].copy()
data_custom.country_name.unique()
array(['Australia', 'Japan', 'Korea, Rep.', 'Malaysia', 'New Zealand',
'Austria', 'Belgium', 'Cyprus', 'Denmark', 'Estonia', 'Finland',
'Germany', 'Iceland', 'Ireland', 'Italy', 'Luxembourg',
'Netherlands', 'Norway', 'Portugal', 'Slovenia', 'Sweden',
'Switzerland', 'United Kingdom', 'Aruba', 'Chile', 'Israel',
'Canada', 'United States'], dtype=object)
data_custom.head()
| indicator_code | region | country_name | country_code | internet_users | gdp_per_capita | SE.SEC.ENRR.UP | SE.TER.ENRR | percent_education_total_gov_expenditure | SP.POP.1564.TO.ZS | SP.POP.TOTL | UIS.EA.3T6.AG25T99 | UIS.GER.4 | UIS.XGDP.23.FSGOV | UIS.XGDP.4.FSGOV | UIS.XGDP.56.FSGOV | total_enrolment | gov_expenditure |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | East Asia & Pacific | Australia | AUS | 46.756116 | 35281.395362 | 249.607498 | 67.038910 | 13.360160 | 66.788576 | 19153000.0 | NaN | 61.390049 | 1.88884 | 0.07649 | 1.14633 | 126.012152 | 3.11166 |
| 4 | East Asia & Pacific | Japan | JPN | 29.990740 | 33871.843545 | 100.454224 | 48.736530 | 9.930410 | 68.232188 | 126843000.0 | NaN | 0.844730 | 1.44606 | NaN | 0.54858 | 50.011828 | NaN |
| 5 | East Asia & Pacific | Korea, Rep. | KOR | 44.700000 | 20756.779897 | 95.884689 | 78.436157 | 20.484921 | 72.199592 | 47008111.0 | 63.714600 | NaN | NaN | NaN | NaN | 87.160423 | NaN |
| 7 | East Asia & Pacific | Malaysia | MYS | 21.384731 | 16309.967988 | 45.810822 | 25.742260 | 21.390680 | 62.719942 | 23185608.0 | 36.407089 | 19.770599 | 2.06063 | 0.17907 | 1.91465 | 30.441227 | 4.15435 |
| 8 | East Asia & Pacific | New Zealand | NZL | 47.379557 | 28264.565116 | 123.198380 | 66.192177 | 20.484921 | 65.468325 | 3857700.0 | NaN | 33.464970 | 2.54104 | 0.08818 | NaN | 74.285175 | NaN |
# higher value, higher percentile rank
data_custom.loc[:, list_features] = data_custom.loc[:, list_features
].rank(pct=True, ascending=True, na_option="keep") * 100
data_custom.head()
| indicator_code | region | country_name | country_code | internet_users | gdp_per_capita | SE.SEC.ENRR.UP | SE.TER.ENRR | percent_education_total_gov_expenditure | SP.POP.1564.TO.ZS | SP.POP.TOTL | UIS.EA.3T6.AG25T99 | UIS.GER.4 | UIS.XGDP.23.FSGOV | UIS.XGDP.4.FSGOV | UIS.XGDP.56.FSGOV | total_enrolment | gov_expenditure |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | East Asia & Pacific | Australia | AUS | 85.714286 | 53.571429 | 249.607498 | 67.038910 | 57.142857 | 66.788576 | 19153000.0 | NaN | 61.390049 | 1.88884 | 0.07649 | 1.14633 | 100.000000 | 3.11166 |
| 4 | East Asia & Pacific | Japan | JPN | 50.000000 | 42.857143 | 100.454224 | 48.736530 | 7.142857 | 68.232188 | 126843000.0 | NaN | 0.844730 | 1.44606 | NaN | 0.54858 | 21.428571 | NaN |
| 5 | East Asia & Pacific | Korea, Rep. | KOR | 78.571429 | 17.857143 | 95.884689 | 78.436157 | 91.071429 | 72.199592 | 47008111.0 | 63.714600 | NaN | NaN | NaN | NaN | 89.285714 | NaN |
| 7 | East Asia & Pacific | Malaysia | MYS | 28.571429 | 14.285714 | 45.810822 | 25.742260 | 96.428571 | 62.719942 | 23185608.0 | 36.407089 | 19.770599 | 2.06063 | 0.17907 | 1.91465 | 3.571429 | 4.15435 |
| 8 | East Asia & Pacific | New Zealand | NZL | 92.857143 | 32.142857 | 123.198380 | 66.192177 | 91.071429 | 65.468325 | 3857700.0 | NaN | 33.464970 | 2.54104 | 0.08818 | NaN | 78.571429 | NaN |
# Il s'agit d'affecter des coefficients( poids) à chaque variable en fonction de son importance que nous considerons.
# custom weighted
weigthed = {"internet_users": .4,
"total_enrolment": .25,
"gdp_per_capita": .25,
"percent_education_total_gov_expenditure": .1,
}
data_custom = (data_custom
.assign(attractivity_score_custom=
lambda dframe: dframe.apply(scoring, features_weight=weigthed, axis=1))
.reset_index()
)
data_custom[list_features + ["attractivity_score_custom"]].head()
| indicator_code | internet_users | gdp_per_capita | total_enrolment | percent_education_total_gov_expenditure | attractivity_score_custom |
|---|---|---|---|---|---|
| 0 | 85.714286 | 53.571429 | 100.000000 | 57.142857 | 78.392857 |
| 1 | 50.000000 | 42.857143 | 21.428571 | 7.142857 | 36.785714 |
| 2 | 78.571429 | 17.857143 | 89.285714 | 91.071429 | 67.321429 |
| 3 | 28.571429 | 14.285714 | 3.571429 | 96.428571 | 25.535714 |
| 4 | 92.857143 | 32.142857 | 78.571429 | 91.071429 | 73.928571 |
top_country_custom = data_custom.nlargest(TOPN, columns=["attractivity_score_custom"], keep="all")
g = sns.catplot(data=top_country_custom,
kind="bar", orient="h", hue=None,
y="country_name", x="attractivity_score_custom", estimator=np.mean,
height=5, aspect=1.4, palette="Greens_r",
)
g.ax.set_title("Selected countries");
Dans cette partie nous allons mener une méthode d'analyse en composante principale
fig = px.scatter_3d(data_frame=data_pivot,
x="internet_users",
y="gdp_per_capita",
z='total_enrolment',
color='region',
hover_name="country_name",
symbol='region',
opacity=0.7,
width=800,
height=500,
)
# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
# Définition de la fonction fillna( elle permettra d'imputer les valeurs manquantes par la médiane)
#
def fillna_by_group(data, groupby, columns, estimator="median"):
""" Fill na by group
Args:
data (pd.DataFrame):
groupby (Union[str, Sequence[str]]):
columns (Union[str, Sequence[str]]):
estimator (str):
Return:
pd.DataFrame
"""
return data_pivot[columns].fillna(data_pivot.groupby(groupby)[columns].transform(estimator))
# Imputer les valeurs manquantes
data_pivot[list_features] = fillna_by_group(data=data_pivot,
groupby="region",
columns=list_features,
estimator="median",
)
data_pivot.head()
| indicator_code | region | country_name | country_code | internet_users | gdp_per_capita | SE.SEC.ENRR.UP | SE.TER.ENRR | percent_education_total_gov_expenditure | SP.POP.1564.TO.ZS | SP.POP.TOTL | UIS.EA.3T6.AG25T99 | UIS.GER.4 | UIS.XGDP.23.FSGOV | UIS.XGDP.4.FSGOV | UIS.XGDP.56.FSGOV | total_enrolment | gov_expenditure |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | East Asia & Pacific | Australia | AUS | 46.756116 | 35281.395362 | 249.607498 | 67.03891 | 13.360160 | 66.788576 | 1.915300e+07 | NaN | 61.390049 | 1.88884 | 0.07649 | 1.14633 | 126.012152 | 3.11166 |
| 1 | East Asia & Pacific | Brunei Darussalam | BRN | 8.996285 | 82049.580860 | 67.462151 | 12.69113 | 8.920740 | 66.955147 | 3.332410e+05 | NaN | 1.296230 | NaN | NaN | NaN | 27.149837 | NaN |
| 2 | East Asia & Pacific | China | CHN | 1.775913 | 3700.743648 | 38.959030 | 7.72048 | 20.484921 | 68.462574 | 1.262645e+09 | NaN | 5.307240 | NaN | NaN | NaN | 17.328917 | NaN |
| 3 | East Asia & Pacific | Fiji | FJI | 1.496855 | 6673.847391 | 59.116379 | NaN | 20.484921 | 61.538812 | 8.112230e+05 | NaN | 2.885000 | NaN | NaN | 0.84457 | 31.000689 | NaN |
| 4 | East Asia & Pacific | Japan | JPN | 29.990740 | 33871.843545 | 100.454224 | 48.73653 | 9.930410 | 68.232188 | 1.268430e+08 | NaN | 0.844730 | 1.44606 | NaN | 0.54858 | 50.011828 | NaN |
# Standardisation de la base
scaled_data = pd.DataFrame(MinMaxScaler().fit_transform(data_pivot[list_features]),
columns=list_features,
index=pd.MultiIndex.from_frame(data_pivot[["region", "country_name", "country_code"]]))
scaled_data
| internet_users | gdp_per_capita | total_enrolment | percent_education_total_gov_expenditure | |||
|---|---|---|---|---|---|---|
| region | country_name | country_code | ||||
| East Asia & Pacific | Australia | AUS | 0.898952 | 0.303129 | 1.000000 | 0.325107 |
| Brunei Darussalam | BRN | 0.171331 | 0.749327 | 0.179070 | 0.153203 | |
| China | CHN | 0.032197 | 0.001830 | 0.097519 | 0.600994 | |
| Fiji | FJI | 0.026820 | 0.030196 | 0.211046 | 0.600994 | |
| Japan | JPN | 0.575888 | 0.289681 | 0.368910 | 0.192299 | |
| ... | ... | ... | ... | ... | ... | ... |
| Sub-Saharan Africa | Equatorial Guinea | GNQ | 0.000526 | 0.069492 | 0.000954 | 0.362172 |
| Mauritius | MUS | 0.138289 | 0.072206 | 0.281844 | 0.362172 | |
| Namibia | NAM | 0.029669 | 0.024779 | 0.088199 | 0.656760 | |
| Seychelles | SYC | 0.140487 | 0.142577 | 0.286187 | 0.362172 | |
| South Africa | ZAF | 0.101041 | 0.059216 | 0.321362 | 0.362172 |
92 rows × 4 columns
Avant de réaliser l'ACP nous allons effectuer le test de sphéricité de Bartlett
# le test de Bartlett.
import scipy
from scipy.stats import bartlett
p = scipy.stats.bartlett( scaled_data.internet_users,scaled_data.gdp_per_capita,scaled_data.total_enrolment,scaled_data.percent_education_total_gov_expenditure)
p
BartlettResult(statistic=32.704415217275425, pvalue=3.7178129709993507e-07)
# Application de l'ACP sur nos variables
pca = PCA(n_components=None)
pca.fit(scaled_data)
explain_var_ratio = pca.explained_variance_ratio_
explain_var_ratio
array([0.59491999, 0.20313471, 0.11460959, 0.08733571])
# weighted: Variance expliquee de chaque composante multipliée par la coordonnee de chaque variable
weighted = sum(np.multiply(explain_var_ratio.reshape(1, -1).T, pca.components_))
# normalized weight: sum of weigth will be equal to 1
normalize_weighted = (weighted / sum(weighted)).round(2)
features_weight_stat = dict(zip(list_features, normalize_weighted))
features_weight_stat
{'internet_users': 0.43,
'gdp_per_capita': 0.23,
'total_enrolment': 0.2,
'percent_education_total_gov_expenditure': 0.15}
data_pivot[list_features]
| indicator_code | internet_users | gdp_per_capita | total_enrolment | percent_education_total_gov_expenditure |
|---|---|---|---|---|
| 0 | 46.756116 | 35281.395362 | 126.012152 | 13.360160 |
| 1 | 8.996285 | 82049.580860 | 27.149837 | 8.920740 |
| 2 | 1.775913 | 3700.743648 | 17.328917 | 20.484921 |
| 3 | 1.496855 | 6673.847391 | 31.000689 | 20.484921 |
| 4 | 29.990740 | 33871.843545 | 50.011828 | 9.930410 |
| ... | ... | ... | ... | ... |
| 87 | 0.132355 | 10792.733023 | 5.699935 | 14.317360 |
| 88 | 7.281535 | 11077.147618 | 39.526619 | 14.317360 |
| 89 | 1.644740 | 6106.108164 | 16.206575 | 21.925060 |
| 90 | 7.395629 | 18453.109771 | 40.049661 | 14.317360 |
| 91 | 5.348560 | 9715.620644 | 44.285725 | 14.317360 |
92 rows × 4 columns
scaled_data = (scaled_data
.assign(attractivity_score=
lambda dframe: dframe.apply(scoring, features_weight=features_weight_stat, axis=1))
.reset_index()
)
scaled_data
| region | country_name | country_code | internet_users | gdp_per_capita | total_enrolment | percent_education_total_gov_expenditure | attractivity_score | |
|---|---|---|---|---|---|---|---|---|
| 0 | East Asia & Pacific | Australia | AUS | 0.898952 | 0.303129 | 1.000000 | 0.325107 | 0.705035 |
| 1 | East Asia & Pacific | Brunei Darussalam | BRN | 0.171331 | 0.749327 | 0.179070 | 0.153203 | 0.304812 |
| 2 | East Asia & Pacific | China | CHN | 0.032197 | 0.001830 | 0.097519 | 0.600994 | 0.123919 |
| 3 | East Asia & Pacific | Fiji | FJI | 0.026820 | 0.030196 | 0.211046 | 0.600994 | 0.150836 |
| 4 | East Asia & Pacific | Japan | JPN | 0.575888 | 0.289681 | 0.368910 | 0.192299 | 0.416886 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 87 | Sub-Saharan Africa | Equatorial Guinea | GNQ | 0.000526 | 0.069492 | 0.000954 | 0.362172 | 0.070726 |
| 88 | Sub-Saharan Africa | Mauritius | MUS | 0.138289 | 0.072206 | 0.281844 | 0.362172 | 0.186766 |
| 89 | Sub-Saharan Africa | Namibia | NAM | 0.029669 | 0.024779 | 0.088199 | 0.656760 | 0.134611 |
| 90 | Sub-Saharan Africa | Seychelles | SYC | 0.140487 | 0.142577 | 0.286187 | 0.362172 | 0.204765 |
| 91 | Sub-Saharan Africa | South Africa | ZAF | 0.101041 | 0.059216 | 0.321362 | 0.362172 | 0.175665 |
92 rows × 8 columns
top_country = scaled_data.nlargest(TOPN, columns=["attractivity_score"], keep="all")
g = sns.catplot(data=top_country,
kind="bar", orient="h", hue=None,
y="country_name", x="attractivity_score", estimator=np.mean,
height=5, aspect=1.4, palette="Greens_r",
)
g.ax.set_title("Selected countries") ;
(pd.melt(top_country,
id_vars=["country_name", "region"],
value_vars=list_features + ["attractivity_score"],
var_name=None,
value_name='value',)
.pipe((sns.catplot, "data"), kind="swarm",
y="country_name", x="value", hue="variable", height=5, aspect=1.5
)
);
g = sns.PairGrid(top_country,
x_vars=list_features, y_vars=["country_name"],
height=6, aspect=.5,
)
# Stripplot
g.map(sns.stripplot, size=10, orient="h", jitter=False,
palette="flare_r", linewidth=1, edgecolor="w")
titles = list_features
# Labels
g.set(xlabel="", ylabel="")
for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
ax.set(title=title)
ax.set_xticklabels([])
# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)
sns.despine(left=True, bottom=True)
# top of region
scaled_data.groupby("region").attractivity_score.mean().nlargest(n=10)
region North America 0.620367 East Asia & Pacific 0.366177 Europe & Central Asia 0.332619 Middle East & North Africa 0.240436 Latin America & Caribbean 0.197159 Sub-Saharan Africa 0.130659 Name: attractivity_score, dtype: float64
fig= px.choropleth(scaled_data.reset_index(),
locations="country_code",
color="attractivity_score",
# hover_name="country_name",
hover_data=["region", "country_name"],
animation_frame=None,
color_continuous_scale='Plasma',
height=500,
width=None,
title="Country - Attractivity score",
#range_color=(0, 12),
)
fig.show()
#fig.write_image("images/map_attractivity.png")
#!pip install -U kaleido
#top5 of country byy Region
g = (scaled_data.groupby("region", as_index=False)
.apply(lambda g: g.nlargest(5, columns=["attractivity_score"], keep="all"))
.pipe((sns.catplot, "data"),
kind="bar", orient="h", col="region", col_wrap=2,
y="country_name", x="attractivity_score", estimator=np.mean,
height=3, aspect=1.5, palette="husl", sharex=False, sharey=False
)
)
# save fig
#plt.savefig("images/top_country_per_region.png")
C:\Users\sareg\anaconda3\lib\site-packages\seaborn\categorical.py:3808: UserWarning: Setting `sharey=False` with `color=None` may cause different levels of the `y` variable to share colors. This will change in a future version.